A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
from datetime import date
import math #used during subplot vsulizattion
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# set size of the seaborn plots
sns.set(rc = {'figure.figsize':(15,8)})
# setting the precision of floating numbers to 4 decimal points
pd.set_option("display.float_format", lambda x: "%.4f" % x)
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# SL Library to split data
from sklearn.model_selection import train_test_split
# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
# Libraries to build decision tree classifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
# To tune different models
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
# To get diferent metric scores
from sklearn import metrics
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
roc_auc_score,
roc_curve,
precision_recall_curve,
confusion_matrix,
make_scorer,
)
# for mounting Google Drive to the notebook (to be used only if executing in Google Colab)
from google.colab import drive
drive.mount('/content/drive')
INN = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Projects/Project 4/INNHotelsGroup.csv')
# copying data to another variable to avoid any changes to original data
data = INN.copy()
def histogram_boxplot(data, feature, figsize=(15, 10), kde=True, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 5))
else:
plt.figure(figsize=(n + 2, 5))
plt.xticks(rotation=90, fontsize=10)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=10,
xytext=(0, 3),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
data.head()
data.tail()
#check shape of the datasets
print("INN Data has: ",data.shape[0],'rows and', data.shape[1], 'columns')
data.duplicated().sum()
data[data.duplicated()].count()
print("Unique Values INN Dataset: \n\n",data.nunique())
data.dtypes
data.info()
# check if there are missing values
data.isnull().sum().sort_values(ascending=False)
Observation of Data Check
data.describe(include="all").T
#check summary object statistics
data.describe(include = ['object'])
# filtering object type columns
cat_columns = data.describe(include=["object"]).columns
cat_columns
for i in cat_columns:
print(data[i].value_counts())
print("*" * 40)
print("\n")
Observation of Categorical Statistical Summary
What are the busiest months in the hotel? Answer : A4 - October with 5317 bookings (14.7%)
Which market segment do most of the guests come from? Answer : A3 - Online with a count of 23,214 bookings (64%)
Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments? Answer : B3 - Online bookings has the highest average price per room (112.25), followed by Aviation (100.70) and Offline (91.63) market segments.
What percentage of bookings are canceled? Answer : A2 - 11,885 bookings which makes up 32.8% are cancelled.
Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel? Answer : B4 - Only 16 bokings (1.72%%) of the total bokings of 930 made by repeated guest were cancelled.
Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation? Answer : B12 - It sems that as the number of request increase, the likehood of bookings cancelled decreases.
sns.set_style("darkgrid")
data.hist(figsize=(20, 20))
plt.show()
data['booking_status'].value_counts()
labeled_barplot(data, "booking_status", perc=True)
~ Bookings that were not cancelled is 24,390 (67.2%)
~ Cancelled bookings is 32.8% which is quite low and we should look into ways for making this value even lower (Answer to Leading Question)
data['market_segment_type'].value_counts()
labeled_barplot(data, "market_segment_type", perc=True)
~ The highest booking type/market segment is Online with a count of 23,214 bookings (64%)
data['arrival_month'].value_counts()
labeled_barplot(data, "arrival_month", perc=True, n = None)
~ The busiest month recorded for INN Hotels is October with 5317 bookings (14.7%)
~ The next busiest months are : September, August, and June (Descending Order)
~ Least favorable month is January (2.8%)
data['type_of_meal_plan'].value_counts()
labeled_barplot(data, "type_of_meal_plan", perc=False, n = None)
~ Highest choice of meal type preference by guest are Meal Plan 1 (77%) and the lowest is Meal Plan 3 (less than 0.0001%)
data['room_type_reserved'].value_counts(ascending = False)
labeled_barplot(data, "room_type_reserved", perc=True)
~ Highest guest room reservations are for Room Type 1
data['no_of_adults'].value_counts()
labeled_barplot(data, "no_of_adults", perc=True)
data['no_of_children'].value_counts()
labeled_barplot(data, "no_of_children", perc=True)
The highest percentage of hotel bookings are not including children
data['repeated_guest'].value_counts()
labeled_barplot(data, 'repeated_guest', perc=True)
According to the data, there were 930 customers (or 2.6% of the total) who were repeated guests, meaning they had stayed at the INN Hotel Group in Portugal at least once before.
data['no_of_weekend_nights'].value_counts()
labeled_barplot(data, "no_of_weekend_nights", perc=True)
data['no_of_week_nights'].value_counts()
labeled_barplot(data, "no_of_week_nights", perc=True)
# combine the no_of_weekend_nights and no_of_week_nights
# we wil drop this in Pre-Process statge
data['total_stay'] = data['no_of_weekend_nights'] + data['no_of_week_nights']
data['total_stay'].value_counts()
Observation
Note :
no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
no_of_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
data['required_car_parking_space'].value_counts()
labeled_barplot(data, "required_car_parking_space", perc=True)
Most of the guest at the hotel DO NO NEED Car Parking Space. There could be several factors contributing to this, such as International Travellers, Business Visits whereby driving may not be required or even a cost factor where the charges may be high for parking (by the hour)
data['no_of_special_requests'].value_counts()
histogram_boxplot(data, "no_of_special_requests")
labeled_barplot(data, "no_of_special_requests", perc=True)
data['no_of_previous_cancellations'].value_counts()
histogram_boxplot(data, "no_of_previous_cancellations")
labeled_barplot(data, "no_of_previous_cancellations", perc=True)
99.1% of the customers that made bookings had no records of previous cancellations
data['avg_price_per_room'].describe()
histogram_boxplot(data, "avg_price_per_room", kde=True)
data['lead_time'].describe()
histogram_boxplot(data, "lead_time", kde=True)
# Lets find out how many Free Rooms does the hotel give away
# count the number of records where the avg_price_per_room is equal to 0.00
free_rooms = data[data['avg_price_per_room'] == 0.00]
# Count the number of free rooms and print the result
num_free_rooms = free_rooms.shape[0]
print("INN hotel gave away {} free rooms.".format(num_free_rooms))
# Count the number of free rooms offered by each room type and print the result
free_rooms_by_type = free_rooms.groupby('room_type_reserved').size()
print(free_rooms_by_type)
I plan to do the Corelation Map in the statrt itself with the original DF to eliminate the ned to drop aditional variables created for EDA
data["booking_status"] = data["booking_status"].apply(lambda x: 1 if x=="Canceled" else 0)
# creating a list of numeric columns
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
# correlation heatmap
plt.figure(figsize=(15, 7))
sns.heatmap(
data[numeric_columns].corr(),
annot=True,
vmin=-1,
vmax=1,
fmt=".2f",
cmap="Spectral",
);
Observaton of Correlation
# lets see what could be the average room rate for cancelled bookings
avg_rate_cancelled = round(data[data['booking_status'] == 1]['avg_price_per_room'].mean(),2)
print('Average room rate for cancelled bookings (in Euros):',avg_rate_cancelled)
# create a new column for total number of nights
data['total_nights'] = data['total_stay']
# find the maximum value of the total nights column
longest_stay = data['total_nights'].max()
# print the result
print('Longest reservation booking made by a guest is: ',longest_stay, "day(s)", "\n")
# find the customer IDs with the longest reservation booking
customer_ids = data[data['total_nights'] == longest_stay]['Booking_ID']
# print the customerID result
print('Customer IDs with the longest reservation booking is:',customer_ids, "\n")
# Group the data by no_of_weekend_nights and calculate the sum of no_of_week_nights
week_nights_by_weekend_nights = data.groupby('no_of_weekend_nights')['no_of_week_nights'].sum()
# Print the results
print(week_nights_by_weekend_nights)
# Group the data by market segment and calculate the average price per room
avg_price_by_segment = data.groupby('market_segment_type')['avg_price_per_room'].mean()
# Reset the index to make sure it is unique
avg_price_by_segment = avg_price_by_segment.reset_index()
# Print the results
print(avg_price_by_segment)
sns.boxplot(data=data, y='avg_price_per_room' , x='market_segment_type');
The boxplot visualization shows the average price per room for each market segment at INN hotel.
# Count the number of repeating guests who canceled their bookings
repeating_guest_cancelled = data[(data['repeated_guest'] == 1) & (data['booking_status'] == 1)]['booking_status'].count()
# Count the total number of repeating guests
total_repeating_guests = data[data['repeated_guest'] == 1]['repeated_guest'].count()
# Calculate the percentage of repeating guests who canceled their bookings
percentage_cancelled = (repeating_guest_cancelled / total_repeating_guests) * 100
#print("Percentage of repeating guests who canceled their bookings:", round(percentage_cancelled, 2), "%")
print(f"{percentage_cancelled:.2f}% of repeating guests cancelled.")
stacked_barplot(data,'repeated_guest','booking_status')
~ Bookings from recurring guest who cancelled is quite low - 1.72% only or [ (16/930) * 100 = 1.72% ]
From this table, we can see that repeated guests are much less likely to cancel their bookings compared to non-repeated guests (only 1.7% of bookings by repeated guests were canceled, compared to 33.5% of bookings by non-repeated guests).
stacked_barplot(data, "no_of_adults", "booking_status")
stacked_barplot(data, "no_of_children", "booking_status")
From these observations, we can see that bookings with no children have the highest number of both not-canceled and canceled bookings.
Furthermore, the number of bookings with 9 or 10 children is very low (only 2 bookings), so the observation may not be representative of the population.
# to create a table showing count of repeating guests and the market segment
table = pd.crosstab(data['repeated_guest'], data['market_segment_type'])
print(table)
Observation
''' to find out what was the median, the lowest and highest lead time (Number of days between the date of booking and the arrival date)
where thee booking status had been cancelled
'''
cancelled_bookings = data[data['booking_status'] == 1]
cancelled_lead_times = cancelled_bookings['lead_time']
print(cancelled_lead_times.describe())
The maximum lead time for cancelled bookings is 443, meaning that there were some bookings that were cancelled almost a year in advance.
Suggestions
# this is to print the count for the number of weekday bookings only (ie. weekend_nights ==0)
# and to determine the count of only the cancelled booking status
count_booking_status = data[data['no_of_weekend_nights'] == 0]['booking_status'].value_counts()
total_bookings = count_booking_status.sum()
percentages = count_booking_status / total_bookings * 100
print(count_booking_status,"\n")
print(percentages)
# plot the stacked bar plot
ax = count_booking_status.plot(kind='bar', stacked=True, figsize=(10, 6))
# set the labels and title
ax.set_xlabel('No. of WeekDay Nights')
ax.set_ylabel('Count')
ax.set_title('Booking Status by No. of WeekDay Nights')
# show the plot
plt.show()
# Filter the DataFrame to include only rows where no_of_weekend_nights > 0
weekend_bookings = data[data['no_of_weekend_nights'] > 0]
# Count the number of rows in the filtered DataFrame
num_weekend_bookings = len(weekend_bookings)
# Print the number of weekend bookings
print(f"The total number of bookings (weekend and weekday) made is : {num_weekend_bookings}","\n")
counts_weekend = data[data['no_of_weekend_nights'] > 0]['booking_status'].value_counts()
print(counts_weekend)
# to show the count for the rows where weekend_night > 0 (ie. does not include weekend) where bookings are canceled
counts = data[(data['no_of_weekend_nights'] > 0) & (data['booking_status'] == 1)].groupby(['no_of_weekend_nights', 'booking_status']).size().reset_index(name='counts')
# Print the counts for each combination of no_of_weekend_nights and booking_status
print(counts)
canceled_weekend_bookings = data[(data['no_of_weekend_nights'] > 0) & (data['booking_status'] == 1)]
# Plot the count of canceled bookings on weekends
sns.countplot(x='no_of_weekend_nights', data=canceled_weekend_bookings)
plt.title('Canceled Bookings on Weekends')
plt.xlabel('Number of Weekend Nights')
plt.ylabel('Count')
plt.show()
plt.figure(figsize=(10, 5))
sns.lineplot(data=data, x='arrival_month', y='avg_price_per_room')
plt.show()
stacked_barplot(data, "market_segment_type", "booking_status")
Bookings via Offline and Aviation are close together at 29.9% and 29.6% respectively.
Aviation and offline market segment type show approximately equal number of cancellations(30%)
# create the cross-tabulation
booking_special_req = pd.crosstab(data['booking_status'], data['no_of_special_requests'])
# print the cross-tabulation
print(booking_special_req)
stacked_barplot(data, "no_of_special_requests", "booking_status")
distribution_plot_wrt_target(data, "booking_status", "no_of_special_requests")
Overall, this suggests that customers who made special requests with their boookings are more likely to follow through with their bookings.
stacked_barplot(data, "required_car_parking_space", "booking_status")
stacked_barplot(data, "type_of_meal_plan", "booking_status")
stacked_barplot(data, "room_type_reserved", "booking_status")
# Create a data frame with only cancelled bookings
cancelled_bookings = data[data["booking_status"] == 1]
# Summary of cancelled bookings
cancelled_bookings.describe().T
distribution_plot_wrt_target(data, "avg_price_per_room", "booking_status")
Observations:
distribution_plot_wrt_target(data, "lead_time", "booking_status")
sns.set(rc = {'figure.figsize':(15,8)})
sns.regplot(data=data, x = "no_of_special_requests", y = "avg_price_per_room");
Observations:
# check the lead time for repeating guests who canceled
repeating_guests_canceled = data[(data['repeated_guest'] == 1) & (data['booking_status'] == 1)]
lead_time_repeating_guests_canceled = repeating_guests_canceled['lead_time']
print(lead_time_repeating_guests_canceled.describe())
#a table showing the different booking status with lead time of repeating guests and non-repeating guests,
pivot_table = pd.pivot_table(data, values='lead_time', index=['booking_status'],
columns=['repeated_guest'], aggfunc='mean')
print(pivot_table)
Observation
This suggests that, on average, repeating guests who canceled their bookings tend to book further in advance compared to other guests.
This again suggests that, on average, repeating guests who cancel their bookings tend to have a longer lead time compared to non-repeating guests who cancel, while repeating guests who do not cancel tend to have a shorter lead time compared to non-repeating guests who do not cancel.
# calculate number of people in a room
data['no_of_people'] = data['no_of_adults'] + data['no_of_children']
# filter data for cancelled bookings
cancelled_bookings = data[data['booking_status'] == 1]
# calculate average price per room for cancelled bookings
avg_price_cancelled = cancelled_bookings['avg_price_per_room'].mean()
# calculate the number of people in a room for cancelled bookings
people_cancelled = cancelled_bookings['no_of_people'].mean()
print('Average price per room for cancelled bookings (in Euros):', round(avg_price_cancelled,2))
print('Average number of people in a room for cancelled bookings:', round(people_cancelled,2))
# find out the number of adults and children, and the average room rate where bookings are cancelled.
cancelled_bookings = data[data['booking_status'] == 1]
cancelled_bookings_by_people = cancelled_bookings.groupby(['no_of_adults', 'no_of_children'])['avg_price_per_room'].mean().reset_index()
print(cancelled_bookings_by_people)
# visualize for the above observation using regression plot
# two separate regression lines, one for the number of adults and one for the number of children.
sns.set(style='ticks', color_codes=True)
# plot regplot
sns.regplot(x='no_of_adults', y='avg_price_per_room', data=cancelled_bookings_by_people, label='adults')
sns.regplot(x='no_of_children', y='avg_price_per_room', data=cancelled_bookings_by_people, label='children')
plt.legend()
plt.xlabel('Number of people in room')
plt.ylabel('Average price per room (in Euros)')
plt.title('Average price per room by number of adults and children in cancelled bookings')
plt.show()
# combine the two regression lines into one using hue, based on the no_of_children values.
# create scatter plot with regression lines
sns.lmplot(x='no_of_adults', y='avg_price_per_room', data=cancelled_bookings_by_people, hue='no_of_children')
# set axis labels and title
plt.xlabel('Number of Adults')
plt.ylabel('Average Price per Room (Euros)')
plt.title('Average Room Price by Number of People in Cancelled Bookings')
# display the plot
plt.show()
Observation
Importnat to note that there is data where there is 0 adult and 2 children or 2 adults and 9 children. These can be outliers.
Lets drop all columns created for EDA purposes and drop unecesary columns that does not contribute to the modeling
#drop the boking_id column
data.drop("Booking_ID", axis=1, inplace=True)
# drop the 'total_stay' column
data.drop(labels=['total_stay'], axis=1, inplace=True)
# drop the 'total_nights' column
data.drop(labels=['total_nights'], axis=1, inplace=True)
# drop the 'no_of_people' column
data.drop(labels=['no_of_people'], axis=1, inplace=True)
# make a copy of the data frame before making more changes
df = data.copy()
numerical_cols = df.select_dtypes(include=np.number).columns.tolist()
# Print the list of numerical columns
print(numerical_cols)
print('No of numerical columns are :',len(numerical_cols))
#Let's check for outliers in the data using boxplots
# to prevent errors in defiing the number of subplots, we can use the ceiling mathematical function to determine the best number of subplots to plot
import math
n_rows = int(math.ceil(len(numerical_cols)/4))
plt.figure(figsize=(30, n_rows*4))
for i, variable in enumerate(numerical_cols):
plt.subplot(n_rows, 4, i + 1) # n_rows by 4 cols
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout(pad = 2)
plt.title(variable)
plt.show()
Observations:
# functions to treat outliers by flooring and capping
def treat_outliers(df, col):
"""
Treats outliers in a variable
df: dataframe
col: dataframe column
"""
Q1 = df[col].quantile(0.25) # 25th quantile
Q3 = df[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - 1.5 * IQR
Upper_Whisker = Q3 + 1.5 * IQR
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)
return df
def treat_outliers_all(df, col_list):
"""
Treat outliers in a list of variables
df: dataframe
col_list: list of dataframe columns
"""
for c in col_list:
df = treat_outliers(df, c)
return df
# list of columns to treat for outliers
treat_out_cols = ["lead_time", "avg_price_per_room", "no_of_week_nights", "no_of_weekend_nights"]
Note :
# create a new data frame after treating outliers in the colums
df1 = treat_outliers_all(df, treat_out_cols)
# let's look at the boxplots to see if the outliers have been treated or not
numeric_columns = df1.select_dtypes(include=np.number).columns.tolist()
import math
n_rows = int(math.ceil(len(numeric_columns)/4))
plt.figure(figsize=(30, n_rows*4))
for i, variable in enumerate(numeric_columns):
plt.subplot(5, 4, i + 1)
plt.boxplot(df1[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# converting initial object data type to categorical data tpes.
df1["type_of_meal_plan"] = df1["type_of_meal_plan"].astype("category")
df1["room_type_reserved"] = df1["room_type_reserved"].astype("category")
df1["market_segment_type"] = df1["market_segment_type"].astype("category")
df1.describe().T
df1.info()
Note : Do refer to Lecture Notes :
Data Preparation for modeling
# defining X and y variables (define the dependent and independent variables)
X = df1.drop(["booking_status"], axis=1)
Y = df1["booking_status"]
print(X.head()) # view independent variables
print(Y.head()) # view dependent variables
# let's add the intercept to data
X = sm.add_constant(X)
# creating dummy variables
# The get_dummies() function used creates new columns for each unique value in a categorical variable,
# assigning a value of 1 or 0 to indicate its presence or absence
X = pd.get_dummies(
X,
columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True,
)
X.head()
'''
In machine learning, numerical values are preferred as input features
because algorithms often require mathematical operations. The astype(float) function call is
used to convert all data types in the X dataframe to float, ensuring that all variables
are numeric and can be used as input features in machine learning models.
'''
# to ensure all variables are of float type
X = X.astype(float)
X.head()
# splitting the data in 70:30 ratio for train to test data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
We will now perform logistic regression using statsmodels, a Python module that provides functions for the estimation of many statistical models, as well as for conducting statistical tests, and statistical data exploration.
Using statsmodels, we will be able to check the statistical validity of our model - identify the significant predictors from p-values that we get for each predictor variable.
# fitting logistic regression model on training set
logit = sm.Logit(y_train, X_train.astype(float))
# fitting logistic regression model
lg = logit.fit(disp=False) # setting disp=False will remove the information on number of iterations
# let's print the logistic regression summary
print(lg.summary())
Observation
Both the cases are important as:
If we predict that a booking will not be canceled and the booking gets canceled then the hotel will lose resources and will have to bear additional costs of distribution channels.
If we predict that a booking will get canceled and the booking doesn't get canceled the hotel might not be able to provide satisfactory services to the customer by assuming that this booking will be canceled. This might damage the brand equity.
F1 Score to be maximized, greater the F1 score higher are the chances of minimizing False Negatives and False Positives. (ie. hotel would want to maximize the F1 score to minimize both false positives and false negatives. )First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# predicting on training set
y_pred_train = lg.predict(X_train)
print("Training performance:")
model_performance_classification_statsmodels(lg, X_train, y_train)
There are different ways of detecting (or testing for) multicollinearity. One such way is using the Variation Inflation Factor (VIF).
Variance Inflation factor: Variance inflation factors measure the inflation in the variances of the regression coefficients estimates due to collinearities that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient $\beta_k$ is "inflated" by the existence of correlation among the predictor variables in the model.
General Rule of thumb:
The purpose of the analysis should dictate which threshold to use
vif_series = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Observation
X_train1 = X_train.drop("market_segment_type_Online", axis=1)
vif_series2 = pd.Series(
[variance_inflation_factor(X_train1.values, i) for i in range(X_train1.shape[1])],
index=X_train1.columns,
)
print("Series before feature selection: \n\n{}\n".format(vif_series2))
Observations:
# fitting logistic regression model
logit1 = sm.Logit(y_train, X_train1.astype(float))
lg1 = logit1.fit(disp=False)
print("Training performance:")
model_performance_classification_statsmodels(lg1, X_train1, y_train)
The above process can be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and hence using a loop as below will be more efficient.
# initial list of columns
cols = X_train.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
x_train_aux = X_train[cols]
# fitting the model
model = sm.Logit(y_train, x_train_aux).fit(disp=False)
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
X_train2 = X_train1[selected_features]
logit2 = sm.Logit(y_train, X_train2.astype(float))
lg2 = logit2.fit(disp = False)
print(lg2.summary())
Now we observe that there are no feature with p-value greater than 0.05. Hence we shall consider the features in X_train2 as the final ones and lg2 as final model.
print("Training performance:")
model_performance_classification_statsmodels(lg2, X_train2, y_train)
Based on the coefficients, we can infer that :
Coefficient of no_of_adults, no_of_children, no_of_weekend_nights, no_of_previous_cancellations,avg_price_per_room, meal types, arrival year are all positive. which means that an increase in these variables is associated with a higher probability of cancellation.
Coefficient of some room types, Market segment types, required_car_parking_space,arrival_month, repeated_guest,and no_of_special_requests are negative. This indicates that an increase in these variables is associated with a lower probability of cancellation of bookings.
Note A negative coefficient for a variable in a logistic regression model indicates that as the value of that variable increases, the probability of the target outcome (in this case, cancellation of bookings) decreases.
# converting coefficients to odds
odds = np.exp(lg2.params)
# finding the percentage change
perc_change_odds = (np.exp(lg2.params) - 1) * 100
# removing limit from number of columns to display
pd.set_option("display.max_columns", None)
# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train2.columns).T
An odds ratio greater than 1 indicates that the feature is positively associated with the outcome (i.e., an increase in the feature will increase the odds of cancellation), while an odds ratio less than 1 indicates that the feature is negatively associated with the outcome (i.e., an increase in the feature will decrease the odds of cancellation).
no_of_adults: the odds ratio for "no_of_adults" is 1.1119. Holding all other features constant, for every one-unit increase in the number of adults, the odds of cancellation increase by a factor of 1.1119 or a 11.1866%.
The other odds for the other variables can be explained in a similar way. The odds for the variable "repeated_guest" are 0.0745, which means that the odds of a booking being canceled for a repeated guest are 0.0745 times the odds of a booking being canceled for a guest who is not repeated.
The change_odd% for "repeated_guest" is -92.5511, which means that for every unit increase in the "repeated_guest" variable, the odds of a booking being canceled decrease by 92.5511%.
In simpler terms, being a repeated guest has a negative effect on the odds of a booking being canceled, meaning that repeated guests are less likely to cancel their bookings compared to first-time guests.
# creating confusion matrix
confusion_matrix_statsmodels(lg2, X_train2, y_train)
log_reg_model_train_perf = model_performance_classification_statsmodels(
lg2, X_train2, y_train
)
print("Training performance:")
log_reg_model_train_perf
logit_roc_auc_train = roc_auc_score(y_train, lg2.predict(X_train2))
fpr, tpr, thresholds = roc_curve(y_train, lg2.predict(X_train2))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
Let's see if the f1 score can be improved further, by changing the model threshold using AUC-ROC Curve.
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg2.predict(X_train2))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
# creating confusion matrix
confusion_matrix_statsmodels(
lg2, X_train2, y_train, threshold=optimal_threshold_auc_roc
)
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg2, X_train2, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
y_scores = lg2.predict(X_train2)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
# setting the threshold
optimal_threshold_curve = 0.42
# creating confusion matrix
confusion_matrix_statsmodels(lg2, X_train2, y_train, threshold=optimal_threshold_curve)
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
lg2, X_train2, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression-default Threshold (0.5)",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Dropping the columns from the test set that were dropped from the training set
X_test2 = X_test[list(X_train2.columns)]
Using model with default threshold
# creating confusion matrix
confusion_matrix_statsmodels(lg2, X_test2, y_test)
log_reg_model_test_perf = model_performance_classification_statsmodels(
lg2, X_test2, y_test
)
print("Test performance:")
log_reg_model_test_perf
logit_roc_auc_train = roc_auc_score(y_test, lg2.predict(X_test2))
fpr, tpr, thresholds = roc_curve(y_test, lg2.predict(X_test2))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
Using model with threshold=0.37
# creating confusion matrix
confusion_matrix_statsmodels(lg2, X_test2, y_test, threshold=optimal_threshold_auc_roc)
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg2, X_test2, y_test, threshold=optimal_threshold_auc_roc
)
print("Test performance:")
log_reg_model_test_perf_threshold_auc_roc
Using model with threshold = 0.42
# creating confusion matrix
confusion_matrix_statsmodels(lg2, X_test2, y_test, threshold=optimal_threshold_curve)
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
lg2, X_test2, y_test, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression-default Threshold (0.5)",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
# testing performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression-default Threshold (0.5)",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Test set performance comparison:")
models_test_comp_df
Note on Model Performance Summary Table :
Accuracy measures the proportion of correct predictions, while recall measures the proportion of actual positive cases that were correctly identified by the model. Precision measures the proportion of predicted positive cases that were correctly identified, and the F1 score is the harmonic mean of precision and recall.
### Inference
Since the goal is to maximize the F1 score then we would choose the threshold that gives us the highest F1 score on the validation set or cross-validation set. Hence, in the comparison table, the F1 score is highest for the logistic regression model with the 0.37 threshold on both the training and test sets.
(greater the F1 score, higher are the chances of minimizing False Negatives and False Positives)
X = data.drop(["booking_status"], axis=1)
y = data["booking_status"]
X = pd.get_dummies(X, drop_first=True)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
## Function to calculate recall score
def get_recall_score(model):
'''
model : classifier to predict values of X
'''
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
We will build our model using the DecisionTreeClassifier function. Using default 'gini' criteria to split. Other option include 'entropy'.
model = DecisionTreeClassifier(criterion = 'gini', random_state=1)
model.fit(X_train, y_train)
print("Accuracy on training set : ",model.score(X_train, y_train))
print("Accuracy on test set : ",model.score(X_test, y_test))
#Checking number of positives
y.sum(axis = 0)
What does INN Hotel want?
There are two possible types of loses faced :
Model Evaluation ?
Model evaluation criterion for the Decision Tree Model is the same as specified above for the Logistic Regression Model (i.e., maximize F1 score)
*INN Hotel wants to maximize F1 score, in order to create a model that is good at predicting which bookings are likely to be canceled while minimizing the number of false positives and false negatives.
In this context, a false positive would be a prediction that a booking will be canceled, but in reality, the guest does show up for their reservation, causing a loss of revenue. A false negative would be a prediction that a booking will not be canceled, but in reality, it is canceled, resulting in a loss of revenue and reputation. Both false positives and false negatives can lead to losses for the hotel, but in this context, false negatives (canceled bookings that were not predicted) are likely to be more costly for the hotel's reputation.
(FN) is an error that occurs when a model predicts a negative class for an instance that actually belongs to the positive class.
confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_perf_train
confusion_matrix_sklearn(model, X_test, y_test)
decision_tree_perf_test = model_performance_classification_sklearn(
model, X_test, y_test
)
decision_tree_perf_test
# Recall on train and test
get_recall_score(model)
Notes :
prune the treeBefore pruning the tree let's check the important features.
feature_names = list(X_train.columns)
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="red", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Lets just Print the top-10 most important features in the decision tree
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False).head(n=10))
feature_names = list(X.columns)
print(feature_names)
plt.figure(figsize=(20,30))
tree.plot_tree(model,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model,feature_names=feature_names,show_weights=True))
In general, the deeper you allow your tree to grow, the more complex your model will become because you will have more splits and it captures more information about the data and this is one of the root causes of overfitting
The depth of a decision tree is an important hyperparameter that controls the model complexity. A deeper tree can fit the training data more closely, but it also increases the risk of overfitting
model1 = DecisionTreeClassifier(criterion = 'gini',max_depth=3,random_state=1)
model1.fit(X_train, y_train)
confusion_matrix_sklearn(model1, X_test, y_test)
decision_tree1_perf_train = model_performance_classification_sklearn(
model1, X_train, y_train
)
decision_tree1_perf_test = model_performance_classification_sklearn(
model1, X_test, y_test
)
print(decision_tree_perf_train)
print(decision_tree1_perf_test)
# Recall on train and test
get_recall_score(model1)
After pruning the decision tree to a depth of 4, we can see that the accuracy, precision, and F1 score have decreased on the test set. This could imply that the tree was overfitting to the training data before pruning. The model was probably capturing noise or random fluctuations in the training data, which made it perform well on the training data but poorly on the test data.
On the other hand, the model's performance on the training set remains relatively unchanged after pruning, which suggests that pruning may have helped reduce overfitting.
Overall, while pruning has helped improve somewhat the generalization performance of the decision tree model on the train set, it is important to strike a balance between model complexity and accuracy.
plt.figure(figsize=(15,10))
tree.plot_tree(model1,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model1,feature_names=feature_names,show_weights=True))
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(model1.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False).head(10))
importances = model1.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(10,10))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='blue', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
You can see in important features of previous model, no_of_week_nights was included among important features but here it is not included. this is the shortcoming of pre pruning, we just limit it even before knowing the importance of features and split.
That's why we will go for pre pruning using grid search, maybe setting max_depth to 3 is not good enough
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight="balanced")
# Grid of parameters to choose from
## add from article
parameters = {'max_depth': np.arange(1,10),
'min_samples_leaf': [1, 2, 5, 7, 10,15,20],
'max_leaf_nodes' : [2, 3, 5, 10],
'min_impurity_decrease': [0.001,0.01,0.1]
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(f1_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
confusion_matrix_sklearn(estimator, X_train, y_train)
decision_tree_tune_perf_train = model_performance_classification_sklearn(
estimator, X_train, y_train
)
decision_tree_tune_perf_train
confusion_matrix_sklearn(estimator, X_test, y_test)
decision_tree_tune_perf_test = model_performance_classification_sklearn(
estimator, X_test, y_test
)
decision_tree_tune_perf_test
plt.figure(figsize=(15,10))
tree.plot_tree(estimator,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
# importance of features in the tree building
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="green", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(estimator.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False).head(10))
#Here we will see that importance of features has increased
You can see in important features of previous model, no_of_adults was lost, but here importance of no_of_adults variable is back This shows that hyperparameter tuning using Grid Search is better than randomly limiting a Hyperparameter
But post pruning might give even better results, since there is quite a good possibility that we might neglect some hyperparameters, post pruning will take care of all that.
In DecisionTreeClassifier, this pruning technique is parameterized by the
cost complexity parameter, ccp_alpha. Greater values of ccp_alpha
increase the number of nodes pruned
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
pd.DataFrame(path)
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using the effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced")
clf.fit(X_train, y_train)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]))
For the remainder, we remove the last element in
clfs and ccp_alphas, because it is the trivial tree with only one
node. Here we show that the number of nodes and tree depth decreases as alpha
increases.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
When ccp_alpha is set to zero and keeping the other default parameters
of DecisionTreeClassifier, the tree overfits, leading to
a 99% training accuracy and 87% testing accuracy. As alpha increases, more
of the tree is pruned, thus creating a decision tree that generalizes better.
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(10,5))
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(test_scores)
best_model = clfs[index_best_model]
print(best_model)
print('Training accuracy of best model: ',best_model.score(X_train, y_train))
print('Test accuracy of best model: ',best_model.score(X_test, y_test))
Accuracy isn't the target metric for our model. Objective is a higher F! score
f1_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = f1_score(y_train, pred_train)
f1_train.append(values_train)
f1_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = f1_score(y_test, pred_test)
f1_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("F1 Score")
ax.set_title("F1 Score vs alpha for training and testing sets")
ax.plot(ccp_alphas, f1_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, f1_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(f1_test)
best_model = clfs[index_best_model]
print(best_model)
confusion_matrix_sklearn(best_model, X_train, y_train)
decision_tree_post_perf_train = model_performance_classification_sklearn(
best_model, X_train, y_train
)
decision_tree_post_perf_train
confusion_matrix_sklearn(best_model, X_test, y_test)
decision_tree_post_perf_test = model_performance_classification_sklearn(
best_model, X_test, y_test
)
decision_tree_post_perf_test
plt.figure(figsize=(17,15))
tree.plot_tree(best_model,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
-The decision tree still looks some how complicated but it is less complicated than our initial tree that we had before prunning
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(best_model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
# training performance comparison
models_train_comp_df = pd.concat(
[
decision_tree_perf_train.T,
decision_tree_tune_perf_train.T,
decision_tree_post_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
# testing performance comparison
models_test_comp_df = pd.concat(
[
decision_tree_perf_test.T,
decision_tree_tune_perf_test.T,
decision_tree_post_perf_test.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_test_comp_df
INN Hotel Group objective is to maximize F1 score which is a measure of the balance between precision and recall, which means it considers both false negatives and false positives. The greater the F1 score the higher the chances of minimizing false negatives and false positives
Decision tree with post-pruning is giving the highest F1 score on test set of approximately 81%*
In terms of INN Hotel's objective of maximizing the F1 score, a higher recall score is preferred as it indicates that the model is able to correctly identify more instances of canceled bookings (true positives) out of all the actual canceled bookings (true positives + false negatives). Therefore, the post-pruning decision tree model with the highest recall score of 0.8558 is more aligned with INN Hotel's objective compared to the other two models.
To build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds for INN Hotel Group, the following steps have been successfully undertaken, based on the collection of bookings data made by customers, and shared in this project.
I shall present only some of the relevant insights related to customer booking trends and cancellation . From the graphical analysis carried out, we observe that :
Travel Period
Based on this information, the INN Hotel Group can adjust their pricing, marketing, and staffing strategies to align with the seasonal trends in the hotel industry. They can offer discounts or carry out promotions during the low season (such as January) to attract customers and maintain their occupancy rates. They may also want to increase their staffing levels during the high season (such as October) to meet the high demand for their services.
2. Market Segment
To leverage this trend, the INN Hotel Group may want to invest more in their online marketing and advertising strategies, such as search engine optimization (SEO), pay-per-click (PPC) advertising, and social media campaigns. They can also offer incentives or discounts to customers who book through their online channels to encourage more bookings through this channel.
3. Booking Lead Time and Guest Stay
4. Booking Trend and Cancellations
Hotel booking cancellations are unavoidable at times for genuine reasons. Based on the analysis of the booking data, here are some potential policies that INN Hotel Group could adopt to manage cancellations and refunds in a profitable manner:
Offer flexible cancellation policies: Consider offering flexible cancellation policies that allow guests to cancel or modify their bookings without penalty up to a certain point before their stay. This can help reduce the hesitation of guests to book early, and also help build customer loyalty.
Use pre-payment: Encourage guests to commit to their bookings by requiring pre-payment for non-refundable bookings. This can help prevent last-minute cancellations and no-shows.
Improve communication: Ensure that the hotel's communication with guests is clear and transparent. Guests should be aware of cancellation policies, payment requirements, and any other relevant details. Policies should also be made clear and detailed on online channels during booking confirmations.
Send reminders: Send guests reminders prior to their arrival date to confirm their bookings and remind them of any relevant policies, such as check-in times and cancellation deadlines.
Monitor booking trends: Regularly monitor booking trends and cancellations to identify any patterns or potential issues. This can help you adjust policies and procedures as needed to reduce cancellations and no-shows.
Invest more in their online marketing and advertising strategies, such as search engine optimization (SEO), pay-per-click (PPC) advertising, and social media campaigns.
Have loyalty promotions suchs as Upgrades, Free Roo,s give away, Meal plan incentives for attracting new guests and repeating guests